# Locating Data in DP2 Using XPath XPath, short for XML Path Language, is a language designed for locating information within XML and HTML documents. It facilitates the navigation through elements and attributes, proving indispensable for precise data retrieval in data extraction tasks. ### Basics of XPath XPath expressions allow for the selection of nodes, including elements, attributes, text, and more. Below are basic XPath expressions and their functions: - `//element`: Selects all nodes named `element` within the document. - `/element`: Targets all `element` nodes directly under the root node. - `element[@attribute]`: Finds all `element` nodes with a specific attribute. - `element[@attribute='value']`: Chooses all `element` nodes where the attribute equals `value`. - `element/text()`: Retrieves the text content of `element` nodes. - `element/child::node()`: Selects the child nodes of `element`. ### Advanced Usage XPath's capabilities extend to using logical operators (`and`, `or`), axes (`ancestor`, `descendant`, `following-sibling`), and functions (`contains()`, `starts-with()`, `not()`) for crafting complex queries. 1. **Logical Operators:** ```xpath //input[@type='submit' or @type='button'] ``` This selects all `input` elements with a `type` attribute of either 'submit' or 'button'. 2. **Using Axes:** ```xpath //div/ancestor::form ``` This expression finds `form` ancestors of `div` elements. 3. **Applying Functions:** ```xpath //h2[contains(text(),'News')] ``` It selects `h2` elements containing the text 'News'. ### Applying XPath in DP2 In DP2, XPath expressions are specified as data selectors for precise data extraction. For instance: ```json { "elements": { "postTitle": { "col": "//div[contains(@class, 'post-title')]/text()", "type": "string" }, "link": { "col": "//a/@href", "type": "string" } } } ``` Here, `postTitle` is configured to extract text from `div` elements with 'post-title' class, and `link` extracts the `href` attribute from all links. Here are some additional practical examples of using XPath to extract specific types of information: ### [Extracting Drug Information in ` detail_step `:](Jexter%20Configuration:Extracting%20Drug%20Information%20in%20'detail_step'.md) - **Select `` elements with the class 'drug-name':** ```xpath //td[@class='drug-name'] ``` - **Select `` elements with the class 'approval-number':** ```xpath //td[@class='approval-number'] ``` - **Select `` elements with the class 'company-name':** ```xpath //td[@class='company-name'] ``` - **Select `` elements within `
` elements with the class 'attachments':** ```xpath //div[@class='attachments']/a ``` - **Select `

` elements within `

` elements with the class 'reference':** ```xpath //div[@class='reference']/p ``` ### [Extract the Total Number of Pages in `totalpage_step`:](Jexter%20Configuration:Extract%20the%20Total%20Number%20of%20Pages%20in%20`totalpage_step`.md) - **Locate the last link in pagination (excluding "Next" and "Last"):** ```xpath //div[@class='pagination']//a[5] ``` - **Extract total page count from `` within pagination info:** ```xpath //div[@class='pagination']/span[@class='page-info'] ``` - **Extract total page count from pagination script in static pages:** ```xpath //div[@class='pagination']/script ``` ### [Extracting the Category in `category_step`:](Jexter%20Configuration:Extracting%20the%20Category%20in%20'category_step'.md) - **Extracting Category Name from Parent Box Element:** ```xpath //div[contains(@class, 'p_parentBox')]/a/text() ``` - **Extracting Category Name from Form Middle Content:** ```xpath //div[contains(@class, 'formMiddleContent482')]//a/text() ``` - **Extracting Category Name from Web Component Menu:** ```xpath //div[@class='w-com-menu-in']/ul/li/div/a/text() ``` ### [Extract Page Information in `list_step`:](Jexter%20Configuration:Extract%20Page%20Information%20in%20the%20list_step%20.md) - **Select `
  • ` elements within `